New Approaches Towards Robust and Adaptive Speech Recognition

نویسندگان

  • Herve Bourlard
  • Samy Bengio
  • Katrin Weber
چکیده

In this paper, we discuss some new research directions in automatic speech recognition (ASR), and which somewhat deviate from the usual approaches. More specifically, we will motivate and briefly describe new approaches based on multi-stream and multi/band ASR. These approaches extend the standard hidden Markov model (HMM) based approach by assuming that the different (frequency) channels representing the speech signal are processed by different (independent) "experts", each expert focusing on a different characteristic of the signal, and that the different stream likelihoods (or posteriors) are combined at some (temporal) stage to yield a global recognition output. As a further extension to multi-stream ASR, we will finally introduce a new approach, referred to as HMM2, where the HMM emission probabilities are estimated via state specific feature based HMMs responsible for merging the stream information and modeling their possible correlation. 1 Multi-Channel Processing in ASR Current automatic speech recognition systems are based on (context-dependent or context-independent) phone models described in terms of a sequence of hidden Markov model (HMM) states, where each HMM state is assumed to be characterized by a stationary probability density function. Furthermore, time correlation, and consequently the dynamic of the signal, inside each HMM state is also usually disregarded (although the use of temporal delta and delta-delta features can capture some of this correlation). Consequently, only medium-term dependencies are captured via the topology of the HMM model, while short-term and long-term dependencies are usually very poorly modeled. Ideally, we want to design a particular HMM able to accommodate multiple time-scale characteristics so that we can capture phonetic properties, as well as syllable structures and {long term) invariants that are more robust to noise. It is, however, clear that those different time-scale features will also exhibit different levels of stationarity and will require different HMM topologies to capture their dynamics. There are many potential advantages to such a multi-stream approach, including: 1. The definition of a principled way to merge different temporal knowledge sources such as acoustic and visual inputs, even if the temporal sequences are not synchronous and do not have the same data rate see [13] for further discussion about this. 2. Possibility to incorporate multiple time resolutions (as part of a structure with multiple unit lengths, such as phon(l and syllable). 3. As a particular case of multi-stream processing, mufti-band ASR [2, 5], involving the independent processing and combination of partial frequency bands, have many potential advantages briefly discussed below. In the following, we will not discuss the underlying algorithms (more or less "complex" variants of Viterbi decoding), nor detailed experimental results (see, e.g., [4] for recent results). Instead, we will mainly focus on the combination strategy and discuss different variants arounds the same formalism. 2 Multiband-based ASR 2.1 General Formalism As a particular case of the multi-stream paradigm, we have been investigating an ASR approach based on independent processing and combination of frequency subbands. The general idea, as illustrated in Fig. 1, is to split the whole frequency band (represented in terms of critical bands) into a few subbands on which different recognizers are independently applied. The resulting probabilities are then combined for recognition later in the process at some segmental level. Starting from critical bands, acoustic processing is now performed independently for each frequency band, yielding K input streams, each being associated with a particular frequency band. ' ' !sand ' ' ' ' ' ' ' Speech Signal Spectrogram ' ' ' ' ' ' ' ------------------, ___________________________________________________ _ RecogDized Word Figure 1: Typical multiband-based ASR architecture. In multi-band speech recognition, the frequency range is split into several bands, and information in the bands is used for phonetic probability estimation by independent modules. These probabilities are then combined for recognition later in the process at some segmental level. In this case, each of the K sub-recognizer (channel) is now using the information contained in a specific frequency band Xk = { xt, x~, ... , x~, ... , x~}, where each x~ represents the acoustic (spectral) vector at time n in the k-th stream. In the case of hybrid HMM/ ANN systems, HMM local emission (posterior) probabilities are estimated by an artificial neural network (ANN), estimating P(qjlxn), where q3 is an HMM state and Xn = (x~, ... ,x~, ... ,x:f)t the feature vector at time n. In the case of multi-stream (or subband-based) HMM£ ANN systems, different ANNs will compute state specific stream posteriors P(qjJxn)· Combination ofthese local posteriors can then be performed at different temporal levels, and in many ways, including [2]: untrained linear or trained linear (e.g., as a function of automatically estimated local SNR) functions, as well as trained nonlinear functions (e.g., by using a neural network). In the simplest case, this subband posterior recombination is performed at the HMM state level, which then amounts to performing a standard Viterbi decoding in which local {log) probabilities are obtained from a linear or nonlinear combination of the local subband probabilities. For example, in the initial subband-based ASR, local posteriors P(qjJxn) were estimated according to: K P(qjJxn) = I:wkP(qjJx!,E>k) (1)

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving the performance of MFCC for Persian robust speech recognition

The Mel Frequency cepstral coefficients are the most widely used feature in speech recognition but they are very sensitive to noise. In this paper to achieve a satisfactorily performance in Automatic Speech Recognition (ASR) applications we introduce a noise robust new set of MFCC vector estimated through following steps. First, spectral mean normalization is a pre-processing which applies to t...

متن کامل

Towards Robust and Adaptive Speech Recognition Models

In this paper, we discuss a family of new Automatic Speech Recognition (ASR) approaches, which somewhat deviate from the usual ASR approaches but which have recently been shown to be more robust to nonstationary noise, without requiring specific adaptation or “multi-style” training. More specifically, we will motivate and briefly describe new approaches based on multi-stream and subband ASR. Th...

متن کامل

An Information-Theoretic Discussion of Convolutional Bottleneck Features for Robust Speech Recognition

Convolutional Neural Networks (CNNs) have been shown their performance in speech recognition systems for extracting features, and also acoustic modeling. In addition, CNNs have been used for robust speech recognition and competitive results have been reported. Convolutive Bottleneck Network (CBN) is a kind of CNNs which has a bottleneck layer among its fully connected layers. The bottleneck fea...

متن کامل

شبکه عصبی پیچشی با پنجره‌های قابل تطبیق برای بازشناسی گفتار

Although, speech recognition systems are widely used and their accuracies are continuously increased, there is a considerable performance gap between their accuracies and human recognition ability. This is partially due to high speaker variations in speech signal. Deep neural networks are among the best tools for acoustic modeling. Recently, using hybrid deep neural network and hidden Markov mo...

متن کامل

روشی جدید در بازشناسی مقاوم گفتار مبتنی بر دادگان مفقود با استفاده از شبکه عصبی دوسویه

Performance of speech recognition systems is greatly reduced when speech corrupted by noise. One common method for robust speech recognition systems is missing feature methods. In this way, the components in time - frequency representation of signal (Spectrogram) that present low signal to noise ratio (SNR), are tagged as missing and deleted then replaced by remained components and statistical ...

متن کامل

Multiple Approaches to Robust Speech Recognition

2. ACOUSTICAL PRE-PROCESSING This paper compares several different approaches to robust speech We have found that two major factors degrading the performance of recognition. We review CMU’s ongoing research in the use of speech recognition systems using desktop microphones in normal acoustical pre-processing to achieve robust speech recognition, inoffice environments are additive noise and unkn...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000